Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivet-site Ready Ready Preview Comment Nov 20, 2025 8:55pm
rivetkit-serverless Ready Ready Preview Comment Nov 20, 2025 8:55pm
2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:55pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:55pm

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 18, 2025

Pull Request Review

Summary

This PR fixes a race condition where actor stops during WebSocket connections or HTTP requests weren't being properly handled, leading to requests hanging indefinitely. The fix adds subscriptions to the actor::Stopped workflow event and properly terminates pending operations when actors stop.


✅ Positive Aspects

  1. Critical Bug Fix: Addresses a real race condition where requests/WebSockets could hang if an actor stops during operation
  2. Consistent Error Handling: Uses appropriate error types (ServiceUnavailable, WebSocketServiceUnavailable) that match existing patterns
  3. Hibernation Awareness: Correctly handles the hibernation case for WebSockets - hibernating if allowed, otherwise returning unavailable
  4. Proper Logging: Good use of structured logging with tracing::debug! to track when actors stop during operations

🔍 Code Quality Observations

1. Duplicate Subscription Pattern

In lib.rs, the stopped_sub subscription is created in three places. Consider creating the subscription once at the beginning of handle_request and handle_websocket methods and reusing it throughout the function to reduce redundant subscription overhead.

2. Runner WebSocket Behavior Change (Important)

In pegboard-runner/src/lib.rs:246-252, the behavior was changed so WebSockets are NOT notified when the runner WS closes. This means WebSocket connections will remain open from the gateway perspective even when the runner disconnects, waiting only for the actor to stop. Recommendation: Add integration tests to verify this behavior works correctly.

3. Comment Accuracy

Line 385 comment changed from "Send reclaimed messages" to "Send pending messages" - good improvement in clarity! ✅


🐛 Potential Issues

1. Subscription Timing in Hibernation Flow

The stopped_sub is created at line 290 for WebSocket open handling, but if after_hibernation is true (line 310), the WebSocket open logic is skipped. Verify that the handle_websocket_hibernation_inner method subscribes to the Stopped event, or document why this race is acceptable.

2. Error Propagation During Cleanup

When sending the WebSocket close message fails, only an error log is produced. Should this error be propagated? If the close message fails to send, the runner might not clean up properly.


🔒 Security Considerations

No security issues identified. The changes properly handle error cases and don't introduce new attack vectors. ✅


⚡ Performance Considerations

  1. Subscription Overhead: Multiple subscriptions to the same event stream could add overhead. Consider optimizing by reusing subscriptions.
  2. HashMap Lookups: The has_pending_websocket_messages method performs a HashMap lookup on the hot path during hibernation.

🧪 Test Coverage

Major Gap: No tests are included with this PR for critical race condition fixes.

Recommendations:

  • Unit tests: Mock the stopped_sub to fire during various stages
  • Integration tests: Simulate actor stopping during active request/WebSocket
  • Edge cases: Actor stops immediately after subscription, multiple concurrent requests

🎯 Recommendations Summary

Must Address:

  1. Verify hibernation subscription coverage for stop events
  2. Add tests for the actor stop scenarios

Should Consider:

  1. Optimize subscription creation (reuse where possible)
  2. Clarify intended behavior for WebSocket when runner disconnects but actor continues
  3. Add integration tests for the new behavior

Final Verdict

This is a solid bug fix that addresses a real race condition. The implementation is clean and follows existing patterns. However, the lack of tests for such critical race condition fixes is concerning.

Recommended Action: Approve with the requirement that tests be added before merging, especially for the WebSocket-specific behavior changes.

Great work on identifying and fixing this race condition! 🚀

@MasterPtato MasterPtato force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from 6c5e080 to dc9084a Compare November 18, 2025 23:50
@MasterPtato MasterPtato force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from e83ac9b to 00cd5f2 Compare November 18, 2025 23:50
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 18, 2025 23:50 Inactive
@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: fix(guard): handle actor stopped during ws/req

Summary

This PR adds handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway, ensuring graceful cleanup when actors stop mid-request. The changes also improve WebSocket lifecycle management in the runner.

Code Quality ✅

Strengths:

  • Clear separation of concerns with stopped_sub subscription in both request and WebSocket handlers
  • Consistent error handling using appropriate error types (ServiceUnavailable, WebSocketServiceUnavailable, WebSocketServiceHibernate)
  • Follows existing patterns and conventions in the codebase
  • Good use of tokio::select! for concurrent event handling

Detailed Analysis

1. pegboard-gateway/src/lib.rs - Actor Stopped Handling

Lines 155-158, 290-293: Subscription creation

let mut stopped_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Stopped>(("actor_id", self.actor_id))
    .await?;

Good: Subscribes to actor stopped events before processing requests
⚠️ Consideration: The subscription happens after reading the request body. If the actor stops during body reading, it won't be detected. Consider moving the subscription earlier if this is a concern.

Lines 220-223, 356-359, 436-444: Select arms for stopped events

_ = stopped_sub.next() => {
    tracing::debug!("actor stopped while waiting for request response");
    return Err(ServiceUnavailable.build());
}

Good: Appropriate error types returned for each context
Good: Debug logging helps with troubleshooting
Good: Hibernation-aware logic in WebSocket handler (lines 436-444)

2. pegboard-gateway/src/shared_state.rs - Pending Messages Check

Lines 347-357: New has_pending_websocket_messages method

pub async fn has_pending_websocket_messages(&self, request_id: RequestId) -> Result<bool> {
    let Some(req) = self.in_flight_requests.get_async(&request_id).await else {
        bail!("request not in flight");
    };

    if let Some(hs) = &req.hibernation_state {
        Ok(!hs.pending_ws_msgs.is_empty())
    } else {
        Ok(false)
    }
}

Good: Simple, focused method
Good: Proper error handling for missing requests
Good: Safe handling of optional hibernation state

Lines 609-616: Early wake optimization in handle_websocket_hibernation

// Immediately rewake if we have pending messages
if self
    .shared_state
    .has_pending_websocket_messages(unique_request_id.into_bytes())
    .await?
{
    return Ok(HibernationResult::Continue);
}

Excellent: Prevents unnecessary hibernation when messages are pending
Performance: Avoids spawning keepalive task unnecessarily

3. pegboard-runner/src/lib.rs - WebSocket Lifecycle Changes

Lines 246-279: Simplified close message handling

// Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
    // Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
    // instead they wait for the actor itself to stop.
    if req.is_ws {
        continue;
    }
    // ... send ToServerResponseAbort
}

Good: Clear distinction between WebSocket and HTTP request lifecycle
Good: Comprehensive comment explains the design decision
Good: Removes complex WebSocket close logic that was previously trying to send close frames

Removed Code Analysis:

  • Removed lines 13, 246-262: Eliminated CloseFrame and CloseCode imports and complex close logic
  • Simplification: The new approach is cleaner - WebSockets wait for actor stop signal rather than runner disconnect

Potential Issues & Suggestions

Minor Issues

  1. Comment Typo (line 385):
// Send pending messages

Previous comment said "Send reclaimed messages" - good fix! ✅

  1. Subscription Timing (lines 155-158):
    The stopped_sub is created after reading the request body. If the actor stops during body reading (lines 148-153), it won't be detected immediately. Consider:
// Subscribe before reading body
let mut stopped_sub = self.ctx.subscribe::<...>(...).await?;

let body_bytes = req.into_body().collect().await
    .context("failed to read body")?
    .to_bytes();
  1. Error Propagation Consistency:
    All three stopped event handlers return errors appropriately, but consider if there are cases where you'd want to distinguish between "actor stopped" vs other ServiceUnavailable causes for metrics/logging.

Questions for Discussion

  1. Race Condition: Is there a race between the actor stopping and the subscription being established? If an actor stops between handler start and subscription creation, the event might be missed.

  2. Subscription Cleanup: The stopped_sub is dropped at the end of each function. Is this the desired behavior, or should subscriptions be maintained longer?

  3. Testing: Are there integration tests that verify:

    • Actor stops during request body reading?
    • Actor stops while waiting for response?
    • Actor stops during WebSocket message handling?
    • Actor stops during hibernation?

Performance Considerations

Good: The early wake check (lines 609-616) prevents unnecessary hibernation overhead
Good: Using tokio::select! for concurrent event handling is efficient
⚠️ Minor: Creating a new subscription for each request/WebSocket connection could add overhead at scale. Consider connection pooling or subscription reuse if this becomes a bottleneck.

Security Concerns

No security issues identified

  • Proper error handling prevents information leakage
  • No authentication/authorization changes
  • No unsafe code introduced

Test Coverage

⚠️ Missing Test Evidence: No test files were modified in this PR. Consider adding tests for:

  • Actor stopping during request handling
  • Actor stopping during WebSocket connection
  • Actor stopping during hibernation
  • Early wake when pending messages exist

Recommendations

  1. Add Integration Tests: Cover the new actor stopped scenarios
  2. Consider Subscription Timing: Move subscription creation before body reading if actor stop during body read is a concern
  3. Add Metrics: Consider adding metrics to track how often actors stop mid-request for operational visibility
  4. Documentation: Add a code comment explaining the actor stopped subscription pattern for future maintainers

Verdict

Approval: ✅ Approved with Suggestions

This is a solid PR that improves reliability by handling actor stopped events gracefully. The code follows existing patterns, has appropriate error handling, and includes a nice optimization for hibernation. The suggestions above are minor improvements that could be addressed in follow-up work.

Strengths:

  • Proper error handling
  • Good logging
  • Performance optimization (early wake)
  • Simplified WebSocket lifecycle

Action Items:

  • Consider adding integration tests
  • Consider moving subscription earlier to catch stops during body reading
  • Consider adding metrics for operational visibility

@claude
Copy link

claude bot commented Nov 18, 2025

Code Review: PR #3481 - fix(guard): handle actor stopped during ws/req

Summary

This PR adds proper handling for actor stop events during WebSocket and HTTP request processing in the pegboard gateway. The changes ensure graceful cleanup when actors stop while requests are in-flight.

Code Quality ✅

Strengths:

  • Consistent error handling: Properly returns ServiceUnavailable for HTTP requests and WebSocketServiceUnavailable/WebSocketServiceHibernate for WebSocket connections based on hibernation capability
  • Structured logging: Uses structured logging correctly with tracing::debug!(?x, "message") pattern, following CLAUDE.md conventions
  • Lifecycle management: The optimization in handle_websocket_hibernation (lib.rs:609-616) to immediately rewake when pending messages exist is a good performance improvement
  • Code organization: Subscribes to the Stopped signal early in both handle_request and handle_websocket methods, ensuring proper cleanup

Potential Issues ⚠️

1. Subscription Created After Request Start (Medium Priority)

In handle_request (lib.rs:155-158), the stopped_sub subscription is created after reading the request body but before starting the in-flight request. If the actor stops between reading the body and subscribing, there's a small race condition window.

Recommendation: Move the subscription to the earliest point possible.

2. WebSocket: No Stopped Handling During Hibernation (Medium Priority)

In handle_websocket_hibernation_inner (lib.rs:668-699), the method only listens for Ready events and WebSocket messages, but doesn't handle the case where the actor stops during hibernation.

Recommendation: Add a stopped_sub to the hibernation select to handle actor stops during hibernation.

3. Runner-side: Changed Semantics (Low Priority)

In pegboard-runner/src/lib.rs:249-253, WebSockets no longer receive close messages when the runner WebSocket closes. Consider documenting this behavior more clearly.

Performance Considerations ✅

Positive:

  • Early check for pending messages in hibernation avoids unnecessary keepalive task spawn
  • Subscriptions are cleaned up properly via tokio::select patterns

Security Concerns ✅

No security issues identified. Error messages don't leak sensitive information.

Test Coverage ⚠️

Missing:

  • No test files found in pegboard-gateway package
  • This PR adds critical error handling logic that should be tested

Recommendations:

  1. Add integration tests for actor stop scenarios
  2. Test the race condition scenarios
  3. Verify that pending messages are properly handled when actor stops

Overall Assessment

This is a solid improvement that addresses a real gap in actor lifecycle handling. The code quality is good and follows repository conventions. The main concern is the lack of test coverage and a couple of edge cases in the subscription timing.


Approval Status: ⚠️ Request Changes (add tests + fix race conditions)

cc: @MasterPtato

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: fix(guard): handle actor stopped during ws/req

Overview

This PR addresses an important issue where the gateway doesn't properly handle actor stop events during WebSocket and HTTP request processing. The implementation subscribes to actor Stopped events and gracefully terminates connections when actors are no longer available.

Code Quality & Best Practices

✅ Strengths

  1. Consistent Error Handling: The PR correctly uses appropriate error types:

    • ServiceUnavailable for HTTP requests
    • WebSocketServiceUnavailable for WebSocket connections (when hibernation not enabled)
    • WebSocketServiceHibernate for WebSocket connections (when hibernation enabled)
  2. Follows Repository Patterns:

    • Uses structured logging correctly (e.g., tracing::debug!("actor stopped...") without formatting into the message)
    • Adheres to the error handling patterns defined in the codebase
    • Properly uses tokio::select! for concurrent event handling
  3. Good Code Organization: The changes are well-scoped and localized to the affected modules

🔍 Potential Issues

1. Logic Error in is_ws_hibernate check (Critical)

Location: engine/packages/pegboard-gateway/src/lib.rs:562-566

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Issue: The logic appears inverted. map_or_else takes two closures:

  • First closure runs when None (the error case)
  • Second closure runs when Some (the success case)

This means:

  • If lifecycle_res is Err(...), it checks is_ws_hibernate(err)
  • If lifecycle_res is Ok(...), it returns false

But the condition sends a close frame when the result is false, which means:

  • Close frames are sent for successful completions
  • Close frames are NOT sent for hibernation errors

Expected behavior: Close frames should NOT be sent when hibernating, but SHOULD be sent otherwise.

Suggested fix:

// Send close frame to runner if NOT hibernating
if !lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Or more clearly:

let should_send_close = match &lifecycle_res {
    Err(err) => !is_ws_hibernate(err),
    Ok(_) => true,
};

if should_send_close {
    // ... send close message
}

2. Subscription Timing

Location: Multiple locations in lib.rs

The PR creates stopped_sub subscriptions after reading the request body (for HTTP) or after header extraction (for WebSocket). This creates a potential race condition:

let body_bytes = req.into_body().collect().await?; // Actor could stop here
let mut stopped_sub = self.ctx.subscribe::<...>().await?; // We'd miss it

Impact: If the actor stops between request parsing and subscription, the stop event would be missed and the request would timeout instead of failing immediately.

Suggested improvement: Subscribe to actor stopped events as early as possible, before any async operations.

3. Immediate Wake on Pending Messages

Location: engine/packages/pegboard-gateway/src/lib.rs:614-621

if self
    .shared_state
    .has_pending_websocket_messages(unique_request_id.into_bytes())
    .await?
{
    return Ok(HibernationResult::Continue);
}

Observation: This is good defensive programming that prevents unnecessary hibernation when messages are already pending. However, there's no explanation in comments about why this is needed.

Suggestion: Add a comment explaining this check prevents a race condition where messages arrive just as hibernation begins.

4. Missing Stopped Subscription in Hibernation Handler

Location: engine/packages/pegboard-gateway/src/lib.rs:673-705

The handle_websocket_hibernation_inner function subscribes to actor::Ready events but NOT to actor::Stopped events. This means if an actor stops during hibernation, the WebSocket will remain hibernated until either:

  • The actor becomes ready again (unlikely if stopped)
  • The client sends a message or closes

Suggested improvement: Also subscribe to Stopped events during hibernation and close the connection if received.

Performance Considerations

✅ Good Practices

  1. Efficient Subscription Management: The PR properly uses tokio::select! which is efficient for handling multiple async events concurrently.

  2. No Blocking Operations: All operations are properly async, preventing thread blocking.

⚠️ Minor Concerns

  1. Multiple Subscriptions: Each request/WebSocket creates a new subscription to actor stopped events. For actors with high request volume, this could create significant subscription overhead.
    • Impact: Likely minimal in practice since subscriptions are lightweight
    • Consideration: Monitor UPS subscription count if this becomes a bottleneck

Security Concerns

✅ No Major Issues

  • Error messages don't leak sensitive information
  • Proper error propagation without exposing internal state
  • No authentication/authorization bypasses introduced

Test Coverage

⚠️ Missing Tests

The PR doesn't include any tests for the new actor stopped handling. Consider adding integration tests for:

  1. HTTP Request: Actor stops while waiting for response
  2. WebSocket Open: Actor stops during WebSocket handshake
  3. WebSocket Active: Actor stops during active WebSocket connection
  4. WebSocket Hibernation: Actor stops while WebSocket is hibernated
  5. Hibernation with Pending Messages: Verify immediate wake behavior

Example test scenario:

#[tokio::test]
async fn test_actor_stopped_during_request() {
    // Setup: Create actor and start request
    // Action: Stop actor while request is in flight
    // Assert: Request returns ServiceUnavailable error
}

Runner Changes

✅ Improved Clarity

Location: engine/packages/pegboard-runner/src/lib.rs:246-279

The change to NOT send WebSocket close messages when the runner connection closes is correct:

// Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
// instead they wait for the actor itself to stop.
if req.is_ws {
    continue;
}

This properly distinguishes between:

  • HTTP requests: Ephemeral, should be aborted when runner disconnects
  • WebSockets: Persistent, rely on actor lifecycle events

The comment clearly explains this distinction. Good!

Documentation

Website Changes

The changes to website/public/llms*.txt appear to be unrelated documentation cleanup. This is fine but might warrant a separate commit for clarity.

Overall Assessment

Status: ⚠️ Needs Changes

The PR addresses a real issue and the overall approach is sound, but there's a critical logic error in the is_ws_hibernate check that needs to be fixed before merging. Additionally, the race condition with subscription timing and missing test coverage should be addressed.

Required Changes:

  1. ✅ Fix the inverted logic in the is_ws_hibernate check (line 562-566)

Recommended Changes:

  1. Subscribe to actor stopped events earlier to prevent race conditions
  2. Add stopped event handling during hibernation
  3. Add test coverage for the new functionality
  4. Add explanatory comments for the immediate wake optimization

Nice to Have:

  1. Consider separating the documentation cleanup into a separate commit
  2. Add performance monitoring for subscription overhead if deployed at scale

Great work identifying and fixing this issue! The core logic is solid, just needs the critical bug fix and some defensive improvements.

@NathanFlurry NathanFlurry force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from d34ad3e to 44b3d81 Compare November 20, 2025 22:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR implements proper handling of actor stops during WebSocket and HTTP requests in the pegboard gateway. The changes ensure that when an actor stops, in-flight requests and WebSocket connections are gracefully terminated instead of hanging.

Positive Aspects

Core Fix is Sound: The addition of stopped_sub subscriptions to detect actor stop events is the correct approach. This properly handles the edge case where an actor stops while requests are being processed.

Consistent Error Handling: The use of ServiceUnavailable for HTTP requests and conditional error handling for WebSockets (hibernate-capable vs non-hibernate) is appropriate.

Optimization: The early check in handle_websocket_hibernation (lines 614-621) to immediately wake if there are pending messages is a good performance optimization.

Proper Visibility Change: Making is_ws_hibernate public allows proper detection of hibernation errors when deciding whether to send close frames.

Lifecycle Cleanup: The runner cleanup logic correctly skips WebSocket close messages when the runner itself disconnects, as WebSockets wait for actor stop events.

@claude
Copy link

claude bot commented Nov 20, 2025

Issues and Concerns

🔴 Critical: Logic Error in Close Frame Condition

Lines 562-566 in pegboard-gateway/src/lib.rs have an inverted condition:

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Problem: map_or_else evaluates the first closure (checking hibernation) when the result is Err, and returns false for Ok values. This means:

  • Close frames are sent when hibernating (should NOT send)
  • Close frames are NOT sent when not hibernating (should send)

Expected: Close frames should be sent to the runner UNLESS we're hibernating.

Fix Option 1 - Negate the condition:

if \!lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false) {

Fix Option 2 - Use clearer logic:

if lifecycle_res.as_ref().err().map_or(true, |e| \!is_ws_hibernate(e)) {

@MasterPtato MasterPtato force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from 44b3d81 to d34ad3e Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Medium: Race Condition Risk in Request Handler

Lines 155-158 subscribe to actor stop events AFTER reading the request body. If the actor stops between when the request arrives and when we subscribe, we might miss the stop event.

Recommendation: Move the subscription to before reading the body (before line 148) to ensure we catch all stop events.


Medium: Inconsistent Subscription Timing

In handle_request, the stopped subscription is created early (line 155), but in handle_websocket it is created later (line 290) after header extraction. While this may work due to workflow event replay, it would be more consistent and defensive to create subscriptions as early as possible in both handlers.


Minor: Comment Accuracy

Line 385 comment changed from "Send reclaimed messages" to "Send pending messages". While "pending" is clearer, verify this change is intentional and accurate for the hibernation recovery flow.


Minor: Unrelated Documentation Changes

The changes to website/public/llms-full.txt and llms.txt appear unrelated to this PR (removing TypeDoc links, adding blog post URL). These should ideally be in a separate commit or PR for clarity.

@claude
Copy link

claude bot commented Nov 20, 2025

Performance and Security Considerations

Performance

  • The subscriptions are properly cleaned up when the select branches complete
  • Early wake check prevents unnecessary hibernation cycles
  • No blocking operations introduced

Security

  • No security vulnerabilities introduced
  • Proper error handling prevents information leakage
  • Actor stop events are subscription-based (no polling)

Test Coverage

No tests found for pegboard-gateway or pegboard-runner packages. Consider adding integration tests for:

  • Actor stopping during HTTP request
  • Actor stopping during WebSocket open handshake
  • Actor stopping during active WebSocket connection
  • Hibernation behavior with pending messages

Recommendations

  1. URGENT: Fix the inverted close frame condition (line 562-566)
  2. Move stopped subscription before body reading in handle_request
  3. Consider consistency in subscription timing across handlers
  4. Split unrelated documentation changes into separate commit
  5. Add integration tests for actor stop scenarios

Code Quality

Overall code follows repository conventions well:

  • ✅ Proper use of structured logging with tracing
  • ✅ Correct error types from rivet-error
  • ✅ Appropriate use of async/await patterns
  • ✅ Good separation of concerns

The core logic is sound, but the inverted condition is a critical bug that must be fixed before merge.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

This PR adds proper handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway. The changes look solid overall, but I have a few observations and suggestions:

Positive Aspects

  1. Good defensive programming: Adding subscriptions to the actor::Stopped event prevents requests from hanging when an actor stops unexpectedly during processing.

  2. Consistent error handling: The code appropriately returns ServiceUnavailable for regular requests and WebSocketServiceUnavailable/WebSocketServiceHibernate for WebSocket connections based on hibernation capability.

  3. Early wake optimization: The check for pending messages before hibernating (line 614-621 in lib.rs) is a smart optimization to avoid unnecessary hibernation cycles.

  4. Logical cleanup in runner: The removal of WebSocket close messages on runner disconnect (pegboard-runner/src/lib.rs:249-253) makes sense—WebSockets should wait for the actor itself to stop rather than the runner connection.

Issues and Concerns

1. Logic Error in Close Message Condition ⚠️ CRITICAL

Location: engine/packages/pegboard-gateway/src/lib.rs:562-566

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Problem: This condition is inverted. Currently it sends the close message when:

  • lifecycle_res is Err AND is_ws_hibernate(err) returns true
  • lifecycle_res is Ok(_) → returns false

This means the close message is only sent when hibernating, which is the opposite of the intended behavior based on the comment "Send close frame to runner if not hibernating".

Expected behavior: The close message should be sent in all cases EXCEPT when hibernating.

Fix: Negate the condition:

if \!lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Or more clearly:

let should_send_close = lifecycle_res
    .as_ref()
    .err()
    .map_or(true, |err| \!is_ws_hibernate(err));

if should_send_close {

2. Missing Error Propagation Context

Location: engine/packages/pegboard-gateway/src/lib.rs:152-159, 290-293

The new .await? calls for subscribing to actor stopped events do not include context. Following the codebase pattern:

let mut stopped_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Stopped>(("actor_id", self.actor_id))
    .await
    .context("failed to subscribe to actor stopped events")?;

3. Inconsistent Import Style

Location: engine/packages/pegboard-gateway/src/lib.rs:17

The new import breaks the established pattern:

proxy_service::{ResponseBody, is_ws_hibernate},

Following Rust conventions, function imports should maintain their module path or use a more explicit import:

use rivet_guard_core::proxy_service::{self, ResponseBody};
// Then use: proxy_service::is_ws_hibernate(err)

Or if direct import is preferred, it should be on a separate line for clarity.

4. Potential Race Condition

Location: engine/packages/pegboard-gateway/src/lib.rs:614-621

The check for pending messages happens before starting the keepalive task. There is a potential race where:

  1. Check shows no pending messages
  2. A message arrives
  3. Hibernation starts
  4. Message is missed until the next wake

Consider moving this check after setting up the subscription or handling it within the select loop.

5. Missing Test Coverage

The PR description shows 0 test-related changes. Given the critical nature of this fix (handling actor stopped during active requests), integration tests should be added to verify:

  • Request handling when actor stops mid-request
  • WebSocket handling when actor stops during open/message/close phases
  • Hibernation behavior when actor stops
  • Proper error types returned in each scenario

Performance Considerations

The additional subscriptions (stopped_sub) are created for every request/WebSocket connection. This is appropriate given the need to detect actor stopped events, but monitor subscription overhead in high-throughput scenarios.

Security

No security concerns identified. The changes maintain existing error handling patterns and do not introduce new attack surfaces.

Minor Suggestions

  1. Line 387 comment: "Send pending messages" is clearer than "Send reclaimed messages" ✓ (already fixed in PR)

  2. Logging consistency: Consider using ?actor_id in the stopped event log messages for consistency with other tracing calls in the codebase.

Summary

The core concept and implementation are sound, but the inverted condition on line 562-566 is a critical bug that would cause close messages to be sent only during hibernation instead of in all other cases. Please fix this before merging.

The lack of test coverage for these critical paths is also concerning. At minimum, add tests for the actor stopped scenarios to prevent regressions.


Review generated with Claude Code following CLAUDE.md conventions

@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from 274d094 to d44f1d1 Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from d34ad3e to 44b3d81 Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical issue where actors could be stopped during WebSocket or HTTP request handling, potentially leaving clients hanging without proper error responses. The changes introduce proper actor lifecycle monitoring and graceful handling when actors are stopped mid-request.

✅ Positive Aspects

  1. Critical Bug Fix: The PR addresses an important edge case where actor stoppage wasn't being handled during active requests/WebSocket connections
  2. Consistent Error Handling: All three main code paths (request handling, WebSocket open, and WebSocket message loop) now consistently check for actor stoppage
  3. Appropriate Error Responses: Returns ServiceUnavailable for requests and either WebSocketServiceHibernate (if hibernation is enabled) or WebSocketServiceUnavailable (if not) for WebSockets
  4. Code Cleanup: Removed unnecessary close frame sending logic in pegboard-runner that was sending hibernation messages inappropriately
  5. Hibernation Optimization: The new has_pending_websocket_messages function allows for immediate rewake during hibernation if messages are pending, avoiding unnecessary hibernation cycles

🔍 Issues & Concerns

1. Logic Error in Close Frame Sending (lib.rs:563-566)

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Issue: This logic is inverted. map_or_else takes two closures: one for the None case and one for the Some case. The current code:

  • When lifecycle_res is Err: calls is_ws_hibernate(err)
  • When lifecycle_res is Ok: returns false

This means it ONLY sends close frames when the error is NOT a hibernate error. But the condition should skip sending when it IS hibernating.

Expected behavior: You should send the close frame UNLESS we're hibernating. The correct logic should be:

if \!lifecycle_res
    .as_ref()
    .err()
    .map(is_ws_hibernate)
    .unwrap_or(false)
{

Or more clearly:

let should_skip_close = lifecycle_res
    .as_ref()
    .err()
    .map(is_ws_hibernate)
    .unwrap_or(false);

if \!should_skip_close {

2. Visibility Change May Need Documentation (proxy_service.rs:2540)

The function is_ws_hibernate was changed from private to pub. While this is necessary for the fix, consider:

  • Is this part of a public API that external consumers might use?
  • Should this be documented as part of the module's public interface?

3. Race Condition Potential

The subscription to actor::Stopped happens after the request body is read. While this is a very small window, there's a theoretical race where:

  1. Actor stops
  2. Body is read
  3. Subscription is created (misses the stop event)
  4. Request proceeds as if actor is still running

Mitigation: Consider subscribing before reading the request body, or document why this ordering is safe.

4. Inconsistent Comment (lib.rs:385)

// Send pending messages

The comment was changed from "Send reclaimed messages" to "Send pending messages". While "pending" is clearer, consider whether this accurately describes what resend_pending_websocket_messages does (it resends messages that were buffered during hibernation).

🎯 Performance Considerations

  1. Multiple Subscriptions: Each request/WebSocket connection now creates a new subscription to the actor::Stopped event. This is reasonable for correctness, but monitor subscription overhead at scale.

  2. tokio::select! Overhead: Adding another branch to the select loops adds minimal overhead, but it's worth noting for hot paths.

🔒 Security Considerations

No major security concerns. The changes properly propagate error conditions rather than leaving connections in undefined states, which is a security improvement.

📝 Test Coverage

Critical: No tests found for pegboard-gateway or pegboard-runner packages. This PR would benefit greatly from:

  1. Unit tests for the new has_pending_websocket_messages function
  2. Integration tests simulating:
    • Actor stopping during HTTP request
    • Actor stopping during WebSocket open handshake
    • Actor stopping during WebSocket message exchange
    • Proper hibernation behavior when actor stops

Consider adding tests in a follow-up PR if not in this one.

📚 Code Style & Conventions

✅ Follows CLAUDE.md conventions:

  • Uses structured logging with tracing::debug
  • Lowercase log messages
  • Proper error handling with anyhow
  • No formatting changes (as instructed)

🎨 Suggestions for Improvement

  1. Extract magic numbers: u16::MAX / 2 in wrapping_gt could use a named constant
  2. Add tracing spans: The new subscription/select blocks could benefit from spans for debugging
  3. Consider metrics: Track how often actors are stopped mid-request for monitoring

🚀 Recommendation

Conditional Approval - The PR fixes an important issue and the approach is sound, but the logic error in line 563-566 must be fixed before merging. Please:

  1. Fix the inverted logic for close frame sending
  2. Add a test case for actor stoppage during WebSocket handling
  3. Consider the race condition mitigation mentioned above

Once the logic error is addressed, this will be a solid improvement to the actor lifecycle handling.


Great work identifying and fixing this edge case! The consistent error handling across all code paths is particularly well done. 🎯

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review - PR #3481

Summary

This PR fixes a critical race condition where actors could be stopped during WebSocket or HTTP request handling, leading to requests hanging indefinitely. The solution subscribes to the actor::Stopped event and returns appropriate errors when the actor stops mid-request.

Positive Observations

  1. Excellent Race Condition Fix - The core issue is well-identified and properly addressed
  2. Proper Error Handling - Correctly differentiates between hibernatable and non-hibernatable WebSocket connections
  3. Clean Separation of Concerns - The pegboard-runner changes are correct
  4. Good Logging - Debug logs properly placed with clear messages

Critical Issues

Issue 1: Logic Error in Close Frame Condition (line 562-566)

The condition appears inverted. The comment says send close frame if NOT hibernating, but the code will ONLY send close frames when there IS a hibernation error.

Current: if lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)

This evaluates to true only when there is an error AND it is a hibernate error, which is backwards.

Issue 2: Missing Stopped Subscription During Hibernation (line 673-705)

When a WebSocket is hibernating, the code listens for actor::Ready events but not actor::Stopped events. If the actor stops during hibernation, the WebSocket will hang indefinitely until the client disconnects.

Recommendations

  1. Fix the inverted close frame condition logic (critical)
  2. Add stopped_sub handling during hibernation (critical)
  3. Consider moving subscription creation earlier to avoid race conditions
  4. Add test coverage for actor stop scenarios
  5. Separate unrelated website documentation changes

Overall Rating

Needs Revision - The core fix is excellent, but issues 1 and 2 are critical bugs that could cause incorrect behavior in production.

@NathanFlurry NathanFlurry force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from 274d094 to d44f1d1 Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from d34ad3e to 44b3d81 Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

This PR effectively handles the case where an actor stops during WebSocket or HTTP request processing. The implementation is solid overall, with good error handling and proper subscription management. Here are my findings:

✅ Strengths

  1. Proper event subscription: The PR correctly subscribes to pegboard::workflows::actor::Stopped events in all three critical locations (HTTP requests, WebSocket open, and WebSocket message loops).

  2. Consistent error handling: Returns appropriate errors (ServiceUnavailable for HTTP, WebSocketServiceUnavailable or WebSocketServiceHibernate for WebSockets depending on hibernation support).

  3. Clean separation of concerns: The runner-side changes properly distinguish between ephemeral HTTP requests and persistent WebSocket connections.

  4. Hibernation optimization: The new has_pending_websocket_messages check in handle_websocket_hibernation is a smart optimization to immediately rewake if messages are queued.

  5. Fixed close frame logic: The conditional sending of close frames to runners (only when not hibernating) prevents unnecessary messages and properly uses the is_ws_hibernate helper.

🔍 Potential Issues

  1. Logic inversion in close frame condition (lib.rs:563-566)

    if lifecycle_res
        .as_ref()
        .map_or_else(is_ws_hibernate, |_| false)
    {

    This logic appears inverted. The condition reads: "if the result is a hibernate error OR if it's Ok return false". This means the close frame is sent when:

    • Result is Ok (any variant) → returns false → sends close frame
    • Result is Err and NOT hibernate → returns false → sends close frame
    • Result is Err and IS hibernate → returns true → skips close frame

    Wait, I think I misread this initially. Let me reconsider:

    • map_or_else(is_ws_hibernate, |_| false) means: if Err, call is_ws_hibernate(err); if Ok, return false
    • The outer if then inverts the meaning

    Actually, this logic means: "Send close frame if it's NOT a hibernate error". But the condition should read more clearly. Consider:

    let should_send_close = \!lifecycle_res
        .as_ref()
        .is_err_and(is_ws_hibernate);
    
    if should_send_close {
        // send close frame
    }
  2. Subscription cleanup (lib.rs:155-158, 290-293, 436)
    The stopped_sub subscriptions are created but never explicitly cleaned up. While Rust's drop semantics handle this, there are three subscriptions created per request/WebSocket:

    • One in handle_request (line 155)
    • One in handle_websocket (line 290)
    • One used in the spawned task (line 436)

    The spawned task subscription is moved into the task and shares the same lifetime, which is correct. However, consider documenting that subscriptions are automatically cleaned up on drop to make the resource management explicit.

  3. Removed unused import (pegboard-runner/lib.rs:15)
    Good cleanup removing CloseCode import, but verify that all error paths that previously used it now work correctly.

🎯 Best Practices

  1. Logging: All the new error paths have appropriate debug-level logs ("actor stopped while waiting for..."). Excellent for debugging. ✅

  2. Error messages: Clear and descriptive error messages. ✅

  3. Following CLAUDE.md conventions:

    • Uses structured logging with tracing correctly (?x for debug formatting) ✅
    • Lowercase log messages ✅
    • Proper error handling with Result return types ✅

🚀 Performance Considerations

  1. Subscription overhead: Creating three UPS subscriptions per connection adds some overhead, but this is necessary for correctness. The subscriptions are lightweight and properly scoped.

  2. Early wake optimization: The has_pending_websocket_messages check is a good optimization to avoid unnecessary hibernation cycles.

🔒 Security

No security concerns identified. The changes maintain proper error boundaries and don't expose internal state.

🧪 Test Coverage

Concern: No tests were added for this new functionality. Consider adding integration tests that:

  1. Simulate an actor stopping during an HTTP request
  2. Simulate an actor stopping during WebSocket open
  3. Simulate an actor stopping during active WebSocket communication
  4. Verify hibernating WebSockets properly handle actor stops

📝 Minor Suggestions

  1. Consider using is_err_and for the close frame condition to improve readability
  2. Add a comment explaining why WebSockets don't receive close messages when the runner disconnects (as noted in pegboard-runner/lib.rs:249-252)
  3. The website/public/llms*.txt changes appear unrelated to this PR - consider removing them

Summary

Recommendation: Approve with minor suggestions

The core functionality is correct and handles the critical race condition properly. The main concern is test coverage, and there's a minor readability improvement opportunity in the close frame condition. The changes follow the codebase conventions well and integrate cleanly with the existing error handling patterns.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants